Okay.
I apologize for being late.
So, we're still looking at, you could say, classical natural language processing.
One of the big things in there is measuring success.
We talked about precision and recall being mostly used, the mostly used measures.
The thing you should probably realize is that precision and recall aren't quite mathematically
dual.
We're always looking at the rate of true positives, but in two different situations,
the true positives and the false negatives, the true positives and the false positives.
You're not quite so the denominator might seem a bit unsystematic.
Normally if we have these dual things, then they're also mathematically dual, but these
are not.
We are, I would say, about 12 or so, probably at 16 measures that combine true positives,
false positives, negatives and so on in various ways.
There's a Wikipedia has a good introduction to that.
Now you should try and wrap your head around why we want exactly those, why those are
the interesting ones.
For the particular case of, in this case, binary classification.
Of course if you have larger classification tasks with more classes, then you can
generalize those by having separate classifiers for every one of those classes.
So precision and recall per class actually applies here.
So this is something to remember.
We looked at information retrieval.
The idea is, there's a couple of ideas, one is that you basically vectorize words.
My Eveways of doing that, basically the word frequency vector.
And there are less naive ways of doing that where you basically take frequency analysis,
which is just as always counting into account to kind of make more specific words have a higher
impact than less specific words.
And that's really what this TF IDF stuff does.
That's kind of the, I would say, baseline information retrieval.
Basically if you look at web search engines and typically most information retrieval systems
based on whatever they are, basically have these stages.
The first one is you want to harvest the information objects for web search engine that is
essentially crawling the web and storing the whole web on your servers.
Then you typically want to clean them up somehow, do a lot of pre-processing typically.
And then you do some kind of a vectorization, which allows you to do this cosyentric for
retrieval.
And then you either during that, for instance in the vector construction or later as in say
pay drank or so, you try to somehow weave this notion of relevance, relevance to the
average information need into the system.
And depending what the information needs are, depending on what the information objects
are, there's quite a lot of variations on these.
But and that's very important, information retrieval just gives you usually links to the
information objects.
And then it's up to either a later stage of processing or the human to somehow extract
or even combine information to satisfy the information need.
So it's typically information retrieval is typically either something that's directly
addressed to humans, which are very good in information processing or a first stage
in something more interesting.
If you think about the NLP task of say question answering, they typically have an information
Presenters
Zugänglich über
Offener Zugang
Dauer
01:24:48 Min
Aufnahmedatum
2023-07-12
Hochgeladen am
2023-07-17 20:39:09
Sprache
en-US